Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing memory ordering issue in ConcurrentQueue #78142

Merged
merged 1 commit into from
Nov 10, 2022
Merged

Conversation

VSadov
Copy link
Member

@VSadov VSadov commented Nov 10, 2022

Fixes: #76501

@ghost
Copy link

ghost commented Nov 10, 2022

Tagging subscribers to this area: @dotnet/area-system-collections
See info in area-owners.md if you want to be subscribed.

Issue Details

Fixes: #76501

Author: VSadov
Assignees: VSadov
Labels:

area-System.Collections

Milestone: -

@VSadov
Copy link
Member Author

VSadov commented Nov 10, 2022

I am running the repro in multiple processes for about 20 min now with no failures. It typically fails in under one minute.

@VSadov
Copy link
Member Author

VSadov commented Nov 10, 2022

No failures for an hour. So it looks like the scenario has no other issues.

@VSadov
Copy link
Member Author

VSadov commented Nov 10, 2022

thanks @filipnavara for a repro scenario!

@filipnavara
Copy link
Member

Thanks for the fix, LGTM. It should probably be backported to 7.0.

Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Nice find.

@stephentoub
Copy link
Member

It should probably be backported to 7.0.

I'm fine with that; CQ is at the heart of the system due to its usage in ThreadPool. Just note this isn't new; been this way since I missed adding the volatile in 2016 :)
https://github.com/dotnet/corefx/pull/14254/files#diff-c12f070bb2033f631c1e51c214b49d678ecb1c871b24652213654486f8cb187bR567

@VSadov
Copy link
Member Author

VSadov commented Nov 10, 2022

Thanks!!

@VSadov VSadov merged commit fee6181 into dotnet:main Nov 10, 2022
@VSadov VSadov deleted the cqueue1 branch November 10, 2022 17:02
@VSadov
Copy link
Member Author

VSadov commented Nov 10, 2022

The issue is on ToArray/CopyTo/Enumerate path. I think it would be unusual to do such operations while the queue is mutated. Any kind of synchronization that guarantees quiescence will likely make this issue disappear.

Also it requires weak architecture and hardware that utilizes that weakness aggressively. - Missing write fences often cause troubles as write-buffering is common. Missing read fences require that CPU speculates far ahead. These are probably still not very common. Evidently, we have only seen this issue on M1.

I think chances of seeing this causing problems in actual programs are low. On the other hand it is possible, and the fix is very low risk and 7.0 may see a bigger share of weak architectures than prior releases.

I think it may be worth porting.

@VSadov
Copy link
Member Author

VSadov commented Nov 10, 2022

I also think this issue may become a test/stress nuisance in 7.0

It was quite annoying for NativeAOT on OSX lately as it failed fairly often

@VSadov
Copy link
Member Author

VSadov commented Nov 10, 2022

/backport to release/7.0

@github-actions
Copy link
Contributor

Started backporting to release/7.0: https://github.com/dotnet/runtime/actions/runs/3439145097

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ManyConcurrentAddsTakes_ForceContentionWithToArray fails intermittently on osx-arm64
3 participants